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Abstract 

We present a novel approach to the regression of 
quantum mechanical energies based on a scatter¬ 
ing transform of an intermediate electron den¬ 
sity representation. A scattering transform is a 
deep convolution network computed with a cas¬ 
cade of multiscale wavelet transforms. It pos¬ 
sesses appropriate invariant and stability prop¬ 
erties for quantum energy regression. This new 
framework removes fundamental limitations of 
Coulomb matrix based energy regressions, and 
numerical experiments give state-of-the-art accu¬ 
racy over planar organic molecules. 

1. Introduction 

Estimating the ground state energy of atoms and molecules 
is one of the most fundamental and studied topics in com¬ 
putational quantum mechanics. The traditional approach to 
this problem has been to devise clever ways to solve for 
the equations of quantum mechanics. Recently though, it 
has been proposed to attack the problem from a Machine 
Learning (ML) perspective (Rupp et al., 2012). 

Most machine learning approaches are representing the 
molecular state as a Coulomb matrix of pairwise energy 
terms (Rupp et al., 2012; Hansen et al., 2013). An impor¬ 
tant limitation of a Coulomb representation is that it de¬ 
pends on an ordering of the atoms in the molecule. When 
the atom ordering is changed, the Coulomb matrix changes 
while the energy does not. 

A first contribution of this paper is to introduce a new 
molecular representation in the form of a two or three di¬ 


mensional signal. We define a one-to-one mapping be¬ 
tween the molecular state and a real-valued positive func¬ 
tion defined over or K^, which has the physical inter¬ 
pretation of an approximate electron density. This first step 
circumvents the issue of atom ordering. In numerical ap¬ 
plications, we restrict ourselves to planar molecules, with 
atoms lying on the same molecular plane. We will there¬ 
fore use two-dimensional electron densities, but three di¬ 
mensional extensions are calculated similarly. 

Regression of a high dimensional functional requires the 
use of prior knowledge of its invariance and stability prop¬ 
erties. For quantum energy regression, many invariance 
and stability properties of the energy function are known. 
Imposing these same properties onto a representation is im¬ 
portant to obtain accurate regressions. Scattering trans¬ 
forms introduced by (Mallat, 2012) are examples of con¬ 
volutional networks (Krizhevsky et al., 2012; Sermanet 
et al., 2013), computed with iterated wavelet transforms, 
which yield appropriate invariants. Our second contribu¬ 
tion shows that these scattering transforms, successfully 
used for image classification (Bruna & Mallat, 2013; Oyal- 
lon & Mallat, 2014), can be used to regress quantum ener¬ 
gies to state-of-the-art accuracy. 

The paper is organized as follows. Section 2 discusses 
the properties of ’’good” molecular representations, and 
presents the current state-of-the-art along with its known 
limitations. Section 3 introduces the problem of regressing 
energies through sparse linear expansions over dictionar¬ 
ies of density functionals. Section 4 gives the mathemati¬ 
cal details of a number of invariant representations used in 
this work. Section 5 describes the setup of the numerical 
experiments along with the values of all the numerical pa¬ 
rameters used in generating the dictionaries in 4. Finally, 
Section 6 analyses the results of our experiments. 
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2. Molecular representations for energy 
regression 

High dimensional regressions must take advantage of prior 
information and invariances of the function that is esti¬ 
mated in order to reduce the problem dimensionality. 

We start by outlining properties that should be satisfied 
by a molecular representation in an energy regression con¬ 
text. Next, the current best-in-class representation based on 
Coulomb matrices will be presented, along with its known 
limitations. 

2.1. Desirable properties for a molecular 
representation 

We are interested in regressing molecular atomization en¬ 
ergies. A molecule containing K atoms is entirely defined 
by its nuclear charges Zk and its nuclear position vectors pk 
indexed by k. Denoting by x the state vector of a molecule, 
we have 

X = {{pk,Zk) G X M : fc = 1,.. .,K}. 

Since the target value that we are trying to regress is a scalar 
representing a physical energy, we know that: 

Permutation invariance The energy is invariant to the 
permutation of the indexation of atoms in the 
molecule. 

Isometry invariance The energy is invariant to transla¬ 
tions, rotations, and symmetries of the molecule and 
hence to any orthogonal operator. 

Deformation stability The energy is differentiable with 
respect to the distances between atoms. 

Multiscale interactions The energy has a multiscale 
structure, with highly energetic covalent bonds be¬ 
tween neighboring atoms, and weaker energetic ex¬ 
changes at larger distances, such as Van Der Waals 
interactions. 

The deformation stability stems from the fact that a small 
deformation of the molecule induces a small modification 
of its energy. The primary difficulty is to construct a rep¬ 
resentation which satisfies these four properties, while si¬ 
multaneously containing a rich enough set of descriptors to 
accurately regress the atomization energy of a diverse col¬ 
lection of molecules. 

2.2. Coulomb matrix representation 

Representations of distributions of points, which are invari¬ 
ant to orthogonal transformations and stable to deforma¬ 
tions can be defined from pairwise distances between these 


points. This is the strategy adopted by the current state- 
of-the-art in molecular energy regression, which makes use 
of a so-called Coulomb matrix representation (Rupp et al., 
2012; Hansen et al., 2013). Given a state vector x, the 
Coulomb matrix representation C is a function of the pair¬ 
wise distances \pk — Pi\ and of the charge products ZkZ^ 

k = e, 

k^l. 

This representation thus satisfies the isometry invariance 
and deformation stability properties. However, it is not 
invariant to the permutation of the atom indices, which 
is a priori arbitrary. Although (Hansen et al., 2013) pro¬ 
poses many strategies to mitigate this problem, it remains 
a challenge to this day. The most successful strategy is 
to augment the data set by associating to each molecule 
several permutations of its Coulomb matrix. The final pre¬ 
dicted energy is then the average of the predicted energy 
for each permutation. While this technique improves per¬ 
formance, the data augmentation can significantly increase 
the size of the data set. In the context of kernel ridge regres¬ 
sion, which achieves some of the best reported numbers for 
Coulomb matrices, this means that the size of the kernel 
can be very large. Furthermore, the fact that the size of 
the Coulomb matrix depends on the number of atoms K in 
the molecule is another limitation. To remedy that issue, 
a fixed maximum size is set a priori, and small Coulomb 
matrices are padded with zeros on the remaining rows and 
columns. However, once the training phase is complete, 
this approach effectively sets an upper bound on the molec¬ 
ular size supported by the representation. Finally, while the 
Coulomb matrix representation features multiple molecular 
length scales, it treats them on an equal footing. In particu¬ 
lar, it does not take advantage of the multiscale structure of 
the energy, that emphasizes some scales more than others. 
It is possible though that the highly nonlinear regressors 
that couple with Coulomb matrices make up for this short¬ 
coming. Numerical results seem to confirm that intuition, 
as linear regression with Coulomb matrices is an order of 
magnitude worse. 

3. Energy regression from electronic densities 

Hohenberg and Kohn proved in (Hohenberg & Kohn, 1964) 
that the molecular energy E can be written as a functional 
of the electron density p{u) > 0 which specifies the density 
of electronic charge at every point u G . The minimiza¬ 
tion of E{p) over a set of electron densities p leads to the 
calculation of the ground state energy 

f{x) = E{p^) = inf E{p) . 

p 

Hohenberg and Kohn also proved that there is a one to one 
mapping between Px{u), the minimizing density, and x. 


Ck,i = 


1 ^ 2.4 
2^k ’ 

ZkZl 
\Pk -1 
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In this work, we consider neutral molecules for which the 
total electronic charge is equal to the sum of the protonic 
charges Zk so that J px{u)du = J^k Computing is 
as difficult as computing E{px). The next section explains 
how to replace p^ by an approximate density while section 
3.2 describes a sparse linear regression of E. 

3.1. Electronic density approximations 

We use the spirit of the electron density approach by 
representing a; by a crude approximate density p^ of p^ 
which also has a one to one mapping with x and satisfies 
/ Px{u)du — By construction, this approximate 

density is invariant to permutations of atom indices k and 
its expression is given by; 

K 

hiu) = ^pt^'"\u-pk), ( 1 ) 

k=l 

which represents a simple linear superposition of isolated 
atomic densities. The notation a{k) is a shorthand for the 
chemical nature of atom k which determines its nuclear 
charge Zk, and hence which atomic density should be sub¬ 
stituted. Isolated atomic densities are pre-computed from 
Density Functional Theory calculations for every distinct 
atomic species present in a molecular database. The elec¬ 
tron density model (1) only gives a crude approximation 
to px- It is a sum of independent atomic contributions and 
hence does not model any chemical effects like bond shar¬ 
ing. An example of an exact and approximate electron den¬ 
sity is shown on Figure 1. The effect of bond sharing in px 
appears as higher density ’’bridges” between atoms. These 
bridges are almost entirely absent in the case of px- 



Figure 1. (left) Ground state electron density and (right): Ap¬ 
proximate electron density px from 1. 


In this work, molecules are bi-dimensional. As a conse¬ 
quence, we transform the three dimensional density in (1) 
into a two dimensional one, and use that later as our in¬ 
termediate representation. The dimensionality reduction 
in the density is obtained by replacing each of the three 
dimensional atomic densities with two dimensional ones. 


That last transformation is obtained by ’’condensing” the 
three dimensional charge of a spherical shell of radius r and 
width dr onto an annulus with the same radius and width. 

The resulting approximate density representation px is in¬ 
variant to permutations of atom indices but it is not invari¬ 
ant to isometries nor is it multiscale. The regression of the 
molecular energy E{px) is therefore not computed from px 
but from a representation ^{px) which satisfies these prop¬ 
erties. Section 4 will explain how to construct such repre¬ 
sentations ^{px) = {(l)k{px)}k, while the next section ex¬ 
plains how to use them and approximate the energy E{px) 
from a sparse linear regression calculated from a data set of 
training examples; 


f{x) = E{px) = ^ Wk(j)k{px) ■ (2) 

k 


3.2. Sparse Regression by Orthogonal Least Square 

Given a training set of N molecular state vectors and as¬ 
sociated energies {xi , f{xi)}i<i<N, we explain how to 
compute the sparse regression in (2). To simplify notations, 
we shall write (j)k{px) = 4>k{x). A sparse M term regres¬ 
sion is obtained by selecting M functionals {(/)fe„}i<m<M 
and computing an optimized linear regression 

M 

f&iix) = ^ Wm(l>kr^ix), 
m—1 


which minimizes the quadratic error on training examples 


M 


E 


Wm4>k^iXi) - f{Xi) 

m—1 


2 


(3) 


These M functionals are selected with a greedy orthogo¬ 
nal least square forward selection algorithm (Chen et al., 
1991). The procedure selects and orthogonalizes each 
functional, one at a time. 

At the iteration it selects (j)k^, and a Gram-Schmidt 
orthogonalization yields an orthonormalized , which is 
uncorrelated relative to all previously selected functionals: 


Vn < m, 


4>kA^i) = 0 , 

i 

and y]] = 1. 


The dictionary element (p-j: is selected so that the linear 
regression of / over {<p-j: }i<n<m minimizes the quadratic 
error \fm{xi) — f{xi)\‘^. This is equivalent to finding 
so that fi^i) 4’km(^i) maximized. The algo¬ 
rithm can be implemented with a QR factorization, as de¬ 
scribed in (Blumensath & Davies, 2007). The M-term re¬ 
gression can then be written as a function of the original 
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functionals but it is more easily expressed in the or- 
thogonalized Gram-Schmidt basis {(j>^ }m<M' 

M 

fM{x) = w:^(l)Yx) 

m—1 

where each coefficient is the correlation: 

i 

The parameter M is the dimension of the regression model. 
Increasing M reduces the bias error but also increases the 
variance error. The optimal M results from a bias-variance 
trade-off. It is estimated with a cross validation over train¬ 
ing examples. 

The bias error is the minimum approximation error of f{x) 
from an M term linear combination of dictionary vectors 
{4>k{x) = 4)k{px)}k- This error is small if f{x) is well ap¬ 
proximated by an element of a space spanned by M terms 
of the dictionary. This can be true only if the functionals 
4>k{px) have the same invariance and stability properties as 
f{x). The next section explains how to construct such a 
dictionary. 

4. Invariant representations 

The central issue is to define a dictionary ^{px) = 

{<pkiPx)}k which is invariant to isometries, stable to defor¬ 
mations, multiscale, and sufficiently rich to perform an ac¬ 
curate regression of the energy f{x) = E{px). The numer¬ 
ical study is performed over planar molecules. We thus re¬ 
strict Px{u) over the molecular plane u G and normalize 
it so that jg2 Px{u)du = J^k ^k- Invariance to isometries 
and stability to deformations is therefore defined in K^. To 
understand the challenge of defining such a representation 
we begin by defining Fourier and wavelet representations 
and explain the limitations of these two approaches. We 
then motivate the use of the Scattering representation in¬ 
troduced in (Mallat, 2012), to systematically construct sta¬ 
ble invariants for regression. The extension to non-planar 
molecules in involves no mathematical difficulty. 

4.1. Fourier Invariants 

Isometry invariant Fourier type representations based on 
the bispectrum and spherical harmonics are described in 
(Bartok et al., 2010; 2013), and are used to regress poten¬ 
tial energy surfaces for the dynamics of single molecules. 
We present a Fourier representation of an arbitrary function 
p{u), which is invariant to linear isometric transformations 
of u, and discuss some of the limitations of Fourier based 
representations. 


Let p denote the Fourier transform of p\ 

p{oj) = [ du. 

The modulus of the Fourier transform is a translation in¬ 
variant representation of p. To add rotation invariance, we 
take LP averages over circles of radii 7 G K+: 

(4) 

J |u;|=7 

In order to obtain a finite dictionary, we evenly sample the 
radii up to a maximum radius R, and we build the dictio¬ 
nary out of and terms for p = 1 and p = 2: 

^f{p) = 

We use and terms to capture linear and quadratic de¬ 
pendencies in the energy functional. In particular, one can 
prove that an important part of the exact Density Functional 
E{p), namely the Haitree electron-electron repulsion func¬ 
tional, is a weighted sum of Fourier terms (4) forp = 2. 
These results extend for non-planar molecules by replacing 
the integrations in by integrations in K^. 

Numerical results from Table 1 provide the regression error 
obtained over a Fourier dictionary computed from 

the atomic density approximation px in (1). The Fourier 
dictionary can regress atomization energies to nearly 10 
kcal/mol on average, which is a relatively large error. This 
indicates that the ground state energy is not well approxi¬ 
mated in the functional space generated by the Fourier in¬ 
variants. The main reason is that the Fourier representa¬ 
tion ^f{Px) is not stable to molecular deformations. If x 
and hence px is deformed then high frequency terms will 
experience a large change in value. The same issue of in¬ 
stability to deformations appear for bispectrum represen¬ 
tations. This instability can be reduced by replacing the 
Fourier transform by a windowed Fourier transform as in 
(Bartok et al., 2013)) for the regression of potential energy 
surfaces. However, such a representation then only cap¬ 
tures localized interactions within the window size, which 
must not be too large to avoid deformation instabilities. 
Long range interactions, which represent a non-negligible 
part of the energy of many aromatic organic molecules are 
not captured by such representations. Stability to deforma¬ 
tions and capturing short range and long range interactions 
requires a multiscale representation, which motivates the 
use of a wavelet representation. 

4.2. Wavelet Invariants 

A wavelet is a complex valued function ijj having zero aver¬ 
age. We suppose, additionally, that ^ decays exponentially 
away from zero and that where ip* de¬ 

notes the complex conjugate. We utilize Moiiet wavelets. 
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defined as; 

ipiu) = exp +^ 2 /C ^ (exp(f^ui) - C). 

The slant C > 1 yields an anisotropic Gaussian which con¬ 
trols the angular sensitivity of ip. The Fourier transform of 
Ip is concentrated in a frequency domain centered at (^, 0 ), 
while the constant C is set so that J ip = 0. 

Normally, dyadic wavelets are dilated by scales 2^ for 
j € Z. We introduce a scale oversampling by a factor 2 
and dilate the wavelet at scales 2'^/^. In the following, we 
are concentrating on wavelets defined in two dimensions 
M € Two dimensional wavelets are rotated by rg for 
an angle 0 G [0, 27r). A dilated and rotated wavelet is then 
indexed as; 

= 2~^^'^^ip(2~^^^r-eu), 

and the wavelet transform is defined by 

p {p*ipj^e(u)}j,e,u- 

From the wavelet transform we derive isometry invariant 
functionals at different scales by averaging over transla¬ 
tions It G and rotations 0 G [0,27r); 

= / [ \p*'ipj,e{u)\^ d0du. (5) 

’ JR2 Jo 

As with the Fourier dictionary, the wavelet dictionary is 
made finite by utilizing a finite range of scales and various 
and functionals; 

^W{P) = 

The Fourier functionals (4) integrate the frequency energy 
of p over circles of radii 7 . The wavelet functionals, how¬ 
ever, take advantage of the multiscale structure of the en¬ 
ergy and integrate the frequency energy of p over annuli of 
bandwidth 2-^/^. Furthermore, as shown in (Mallat, 2012), 
the wavelet functionals linearize the action of diffeomor- 
phisms on p. Numerical results from Table 1 indicate that 
the wavelet dictionary requires significantly fewer coeffi¬ 
cients to achieve a comparable accuracy as the Fourier dic¬ 
tionary. However, the minimum average error remains sim¬ 
ilar, as the two dictionaries contain similar frequency infor¬ 
mation on the density p. Thus, while the wavelet dictionary 
satisfies all of the properties of Section 2.1, it is not a rich 
enough dictionary to model the energy E to state-of-the-art 
accuracy. This final limitation is resolved through the scat¬ 
tering transform. These results extend in with a wavelet 
transform over K^. 


4.3. Scattering 

The scattering transform augments the wavelet dictionary 
by providing additional multiscale invariants through iter¬ 
ated wavelet transforms. This architecture is a type of deep 
convolutional network, with variations of it applied suc¬ 
cessfully in computer vision for texture classification (Sifre 
& Mallat, 2013; Bruna & Mallat, 2013) as well as object 
classification (Oyallon & Mallat, 2014). 

Deep convolutional networks cascade linear and nonlinear 
operations through multilayer architectures (Bengio et al., 
2013). In the first layer, features from a two dimensional 
function p are extracted via a collection of functionals 
{hk}k- These functionals apply a localized linear filter Lfc 
across the function p via convolution, followed by a non¬ 
linear function F that also downsamples the signal, 

dk{p) = F{p^hk). 

The linear filters are learned from the training set via 
back-propagation. The nonlinear functions may be sig- 
moids, rectifiers or absolute values, followed by a pool¬ 
ing operator. Subsequent layers convolve linear filters both 
spatially and over the collection of functionals {hk}k from 
the previous layer, thus combining information across fil¬ 
ters. 

Wavelets {ipj^ are predefined linear filters that cap¬ 

ture information at scale 2-^1/^ and in the direction 0i. The 
complex modulus is a pointwise nonlinear function which, 
when applied after the wavelet transform, yields function¬ 
als {|p* Ilii.ei analogous to {hk}k- These function¬ 
als are invariant to translation up to scale 2^/"^. Since the 
energy is globally invariant to isometries, the wavelet in¬ 
variant functionals of (5) are derived from global 

LP averages (pooling) of {|p* over trans¬ 

lations u and rotations 0i. However, this integration re¬ 
moves the variation of \p * ipj^fi^{u)\ along (u, 0i), thus 
discarding a considerable amount of information. In order 
to recover some of this lost information we apply a second 
layer of wavelet transforms to the collection of functions 

For fixed (ji, 0i), the function u 1 —)■ |p * V’ii.Si varies 
at scales bigger than 2^/"^. Translation information at scale 
2 ^ 1 /^ and angle 0i is recovered by applying a second spatial 
wavelet transform, with the same Morlet wavelet, for scales 
j 2 > ji and angles 02 , yielding I * V'i 2 .e 2 }j 2 .e 2 - 

Applying a second wavelet transform to each of the func¬ 
tions from the first layer wavelet transform gives the fol¬ 
lowing collection of second layer spatial functions; 

The collection ( 6 ), while recovering lost spatial informa¬ 
tion, does not recover the angular variability of the first 
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layer considered as a function 01 for fixed 

ji and u. For scales on the order of the distance be¬ 
tween neighboring atoms, the behavior of these functions 
reflects the orientation of atomic bonds and hence bond an¬ 
gles. The variability over 0i for larger spatial scales indi¬ 
cates global geometric structure, such as the orientation of 
sub-molecules. This rotation information is recovered by 
applying a wavelet transform over the angles [0,27r) along 
the rotation variable 9i. The wavelet transform is defined 
in terms of circular convolution: 

p27r 

91® 92(9)= gi{e')g2{e-9')d9\ 

Jo 

Periodic dilated wavelets are defined over [0, 27r) by peri- 
odizing a one dimensional dilated Morlet wavelet '■ 

M ^ C, 

(0) = 2"^= XI 0 - 27rfc). 

fcGZ 

The resulting angular part of the second layer transform is 
then: 

Combining the second layer spatial transform ( 6 ) and the 
second layer angular transform (7), and applying the com¬ 
plex modulus, yields the following collection of functions: 

{\\p*^jl,■\*i’j2,e2A)®Ae2(^l)\}h,e^,32,S2,i2,u■ (8) 

As in the wavelet dictionary, isometry invariant function¬ 
als are derived from these second layer wavelet transforms 
via integration over translations and rotations. Note, how¬ 
ever, that a rotation of p propagates through the layers of 
the scattering transform. Indeed, if pe{u) = p{r-gu) is a 
rotation of p by angle 0 , then 

I |pe * V'A.■ I * ^ 32,92 (u) ®1p£2(^l)l = 

IIP*'>/’ 3 i,-l*'>/’ 32 , 92 - 9 (r-gu) @ - 0)1 

Thus both angular variables 0i and 02 are rotationally co¬ 
variant. However, an orthogonal change of coordinates 
( 01 , 02 ) !-)■ (a,/3), with: 

9\ — 92 01 -f 02 

yields one rotationally invariant variable a and one rota¬ 
tionally covariant variable (3. Thus isometry invariant scat¬ 
tering functionals can be derived from ( 8 ) by integrating 
over (m, /3) G X [0, 27r): 

^3l,>'2,pA) ~ 

/ / \\p*'fpji,-\*'^p32,P-Au) ®i^i2A + PW d,/3du, 

Jo 


where A 2 = ( 72 , 0 : 1 ^ 2 ) encodes the parameters of the sec¬ 
ond layer. A finite scattering dictionary is obtained by tak¬ 
ing a finite number of scales 71 , 72,^2 as well as a finite 
number of angles a, for a mixture of and function¬ 
als: 

^S(P) = {Mp),(l>3ul{p),Ajulip),(t>'^j,,2{p),--- 

■ ■ ■ ‘Pjl,X2,l (p)) (p)’ '5^7 i,A2,2(p)}7i A 2 ■ 

The second layer of the scattering transform greatly ex¬ 
pands the wavelet dictionary, but still satisfies the four 
properties of Section 2.1. Numerical experiments show 
(see Table 1) that the average regression error over the scat¬ 
tering dictionary is greatly reduced, to approximately 1.8 
kcal/mol, thus indicating that the second layer functionals 
dramatically increase the ability of the dictionary to model 
the full variability of the energy. The extension of a scatter¬ 
ing transform in is done with three-dimensional spatial 
wavelet transforms, indexed by a direction 0 which belongs 
to the two-dimensional sphere in The second order 
scattering coefficients are then calculated with a wavelet 
transform along 0 , and thus on the two-dimensional sphere 
in (Starck et al., 2006). 

5. Numerical experiments 

We compare the performance of Coulomb matrix represen¬ 
tations with Fourier, wavelet and Scattering representations 
on two databases of planar organic molecules. Molecu¬ 
lar atomization energies from these databases were com¬ 
puted using the PBEO hybrid density functional (Adamo 
& Barone, 1999). The first database includes 454 nearly 
planar molecules among the 7165 molecules of the QM7 
molecular database (Rupp et al., 2012). We also created a 
second database of 4357 strictly planar molecules, which 
we denote QM2D. We produced this new database with a 
similar procedure as the one outlined in (Rupp et al., 2012). 
Both databases consist of a set of organic molecules com¬ 
posed of hydrogen, carbon, nitrogen, oxygen, sulfur and 
additionally chlorine in the case of QM2D. The molecules 
featured in these sets cover a large spectrum of representa¬ 
tive organic groups typically found in Chemical Compound 
Space (Rupp et al., 2012). Particular care was taken in pro¬ 
ducing well-balanced folds used in cross validation assess¬ 
ments. A proper partitioning of the data points among folds 
was indeed outlined in (Hansen et al., 2013) as critical to 
ensure low variance in test errors. 

5.1. Coulomb matrix baseline and error metrics 

In the case of the Coulomb matrix based regression we 
used the best performing representation assigning eight 
randomly sorted Coulomb matrices per molecule as de¬ 
scribed in (Hansen et al., 2013). The width a of the Laplace 
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kernel and the ridge parameter A were selected following 
the algorithm described in the same paper. The algorithm 
was validated by recovering the published accuracy over 
the full QM7 database which contains 7165 molecules. In 
our experiments, we restrict the database size to only 454 
planar molecules so the regression error is larger. The 
same methodology is then followed to compute the opti¬ 
mal Coulomb matrix based regression on QM2D. 

To evaluate the precision of each regression algorithm, each 
database is broken into five representative folds, and all 
tests are performed using five fold cross validation in which 
we reserve four folds for training, and the fifth fold for test¬ 
ing. This results in a regressed energy for each molecule 
in the database. We report the Mean Absolute Error (MAE 
or over each database along with the Root Mean-Square 
Error (RMSE or which is the square root of the average 
squared error. 

5.2. Dictionary Implementations and Sizes 

The number of elements in the Eourier, wavelet, and scat¬ 
tering dictionaries are very different and respectively equal 
to 1537, 61 and 11071 in numerical computations. This 
section explains the implementation of these dictionar¬ 
ies. Molecular configurations are centered at zero, and the 
two dimensional electron density is restricted to a box 
[—a, a]^. The parameter a is chosen so that the density de¬ 
cays to nearly zero at the boundary (in our experiments, 
a = 11 angstroms). The box is then sampled with 2'^ X 2-^ 
evenly spaced grid points, for some resolution J; in prac¬ 
tice J = 10 for the Eourier and wavelet dictionaries, and 
J = 9 for the scattering dictionary. The grid naturally leads 
us to a discretized version of p^. 

Eor the Eourier representation, discrete Eourier transforms 
are computed with a two dimensional East Eourier Trans¬ 
form (EET). Integration over circles of radii 7 is approxi¬ 
mated with finite sums over discretized circular contours, 
which are approximated using the original 2“' X 2'’ spatial 
sampling. This yields 2^~^ functionals for a fixed av¬ 
erage, resulting in a dictionary of size 1 -f 3 • 2*^“^. For 
J = 10, we have 1537 dictionary elements. 

The wavelet parameters are set according to the follow¬ 
ing specifications. The minimum scale is jmin = 0 and 
the maximum scale is jmax = J — ^1“^, resulting in a to¬ 
tal of 2J scales for the wavelet transform. L angles are 
evenly sampled from [ 0 , 7 r) according to 9 = kiv/L for 
fc = 0 ,..., L — 1 , which is equivalent to evenly sampling 
2L angles over [0, 27r) since the Morlet wavelet is symmet¬ 
ric relative to both the x and y axes. In practice we take 
L = 8. The slant is C = 1/2, and the central frequency 
is fixed at ^ = 37 r/ 4 . Given a fixed slant ( and central 
frequency increasing J and L should not decrease the 
accuracy of the regression. As a consequence, we fixed 


the values for these parameters when the regression results 
appeared to plateau. Wavelet transform convolutions are 
computed as multiplications in frequency space by utiliz¬ 
ing FFTs and inverse FFTs. The modulus of the output 
functions from these convolutions are then averaged over 
the discrete spatial grid and the discrete sampling of [0 , tt) 
to obtain the wavelet dictionary functionals. This results in 
a wavelet dictionary of size 1 + 6 J. For J = 10, this yields 
61 dictionary elements 

For the second layer of the scattering transform, scales 
are computed for j 2 > ji, resulting in 2J(2J — l)/2 pairs 
(ji)/ 2 )- Angles 62 are discretized with L samples, and the 
wavelet transform over [0, tt) is computed for dyadic scales 
2^2 with £2 = 0 ,..., log 2 L. The total size of the scattering 
dictionary is then 1 -f 6 J -f 3(2 J(2 J — \)L log 2 T)/2. For 
J = 9 and L = 8 , the scattering dictionary contains 11071 
functionals. 

6. Results and discussion 

The results from our numerical experiments are summa¬ 
rized in Table 1. The Scattering representation computed 
from the atomic density model offers state-of-the-art accu¬ 
racy. Over the larger database, it has a 25% improvement in 
Mean Absolute Error over the best Coulomb matrix based 
regression technique. This improvement goes up to 53% 
in Root Mean Square Error. The larger improvement of 
the RMSE is due to the fact that a scattering regression has 
smaller error outliers. 



Figure 2. Decay of the log RMSE error 
I loga (\f{x) - /m( 2 ;)P)] over the larger database of 
4357 molecules, as a function of log 2 (M) in the Fourier (green), 
Wavelet (blue) and Scattering (red) regressions. The dotted line 
gives the Coulomb regression error for reference. 


Table 1 shows that the error of Fourier and wavelet regres¬ 
sion are of the same order although the Fourier dictionary 
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Table 1. Error in kcal/mol of regressed quantum molecular energies using different molecular representations (vertically) and different 
error measures (horizontally), on two databases of planar organic molecules: left and right parts of the table. 



454 2D molecules from QM7 

4357 molecules in QM2D 

M 

fF MAE 

RMSE 

M 

fF MAE 

e^: RMSE 

Coulomb 

N/A 

7.0 

20.5 

N/A 

2.4 

5.8 

Fourier 

62 

11.9 

16.1 

198 

11.1 

16.7 

Wavelet 

42 

11.1 

15.5 

59 

11.1 

14.2 

Scattering 

74 

6.9 

9.0 

591 

1.8 

2.7 

Chemical Accuracy 

1.0 


has 1537 elements and the wavelet dictionary has only 61. 
Figure 2 gives the decay of these errors as a function of 
M. This exepected error is computed on testing molecules. 
The circles on the plot give the estimated value of M which 
yield a minimum regression error by cross-validation over 
the training set (reported in Table 1). Although the Fourier 
and wavelet regressions reach nearly the same minimum 
error, the decay is much faster for wavelets. When going 
from the smaller to the larger database, the minimum error 
of the Fourier and wavelet regressions remain nearly the 
same. This shows that the bias error due to the inability of 
these dictionaries to precisely regress f{x) is dominating 
the variance error corresponding to errors on the regression 
coefficients. The Coulomb and Scattering representations 
on the other hand, achieve much smaller bias errors on the 
larger database. 

The number of terms of the scattering regression is M = 
591 on the larger database, although the dictionary size is 
11071. A very small proportion of scattering invariants are 
therefore selected to perform this regression. The chosen 
scattering coefficients used for the regression are coeffi¬ 
cients corresponding to scales which fall between the min¬ 
imum and maximum pairwise distances between atoms in 
the molecular database. These selected coefficients are thus 
adapted to the molecular geometries. 

7. Conclusion 

This paper introduced a novel intermediate molecular rep¬ 
resentation through the use of a model electron density. 
The regression is performed on a scattering transform ap¬ 
plied to a model density built from a linear superposition of 


atomic densities. This transform is well adapted to quan¬ 
tum energy regressions because it is invariant to the per¬ 
mutation of atom indices, to isometric transformations, it 
is stable to deformations, and it separates multiscale inter¬ 
actions. It is computed with a cascade of wavelet convolu¬ 
tions and modulus non-linearities, as a deep convolutional 
network. State-of-the-art regression accuracy is obtained 
over two databases of two-dimensional organic molecules, 
with a relatively small number of scattering vectors. Under¬ 
standing the relation between the choice of scattering coef¬ 
ficients and the physical and chemical properties of these 
molecules is an important issue. 

Numerical applications have been carried out over pla¬ 
nar molecules, which allows one to restrict the electronic 
density to the molecular plane, and thus compute a two- 
dimensional scattering transform. A scattering transform 
is similarly defined in three dimensions, with the same in¬ 
variance and stability properties. It involves computing a 
wavelet transform on the two-dimensional sphere S'^ in 
(Starck et al., 2006) as opposed to the circle S^. It entails 
no mathematical difficulty, but requires appropriate soft¬ 
ware implementations which are being carried out. 

Energy regressions can also provide estimations of forces 
through differentiations with respect to atomic positions. 
Scattering functions are differentiable and their differential 
can be computed analytically. However, the precision of 
such estimations remain to be established. 
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